Highly Scalable Discriminative Spam Filtering
نویسندگان
چکیده
This paper discusses several lessons learned from the SpamTREC 2006 challenge. We discuss issues related to decoding, preprocessing, and tokenization of email messages. Using the Winnow algorithm with orthogonal sparse bigram features, we construct an efficient, highly scalable incremental classifier, trained to maximize a discriminative optimization criterion. The algorithm easily scales to millions of training messages and millions of features. We address the composition of training corpora and discuss experiments that guide the construction of our SpamTREC entry. We describe our submission for the filtering tasks with periodical re-training and active learning strategies, and report on the evaluation on the publicly available corpora.
منابع مشابه
Personalized Spam Filtering for Gray Mail
Gray mail, messages that could reasonably be considered either spam or good by different email users, is a commonly observed issue in production spam filtering systems. In this paper we study this class of mail using a large real-world email corpus and signaturebased campaign detection techniques. Our analysis shows that even an optimal filter will inevitably perform unsatisfactorily on gray ma...
متن کاملA Scalable Spam Filtering Architecture
The proposed spam filtering architecture for MTA servers is a component based architecture that allows distributed processing and centralized knowledge. This architecture allows heterogeneous systems to coexist and benefit from a centralized knowledge source and filtering rules. MTA servers in the infrastructure contribute to a common knowledge, allowing for a more rational resource usage. The ...
متن کاملAnti-Spam Grid: A Dynamically Organized Spam Filtering Infrastructure
The spam problem is getting worse all the time. In the paper, we propose Anti-Spam Grid, which can collaboratively filter spam messages by forming a virtual organization. We discuss the design of fuzzy CopyRank and distributed Bayesian algorithm, and describe the architecture of Anti-Spam Grid. A detailed analysis shows that the system is reliable, efficient and scalable, and an experiment show...
متن کاملA scalable intelligent non-content-based spam-filtering framework
Designing a spam-filtering system that can run efficiently on heavily burdened servers is particularly important to the widely used email service providers (ESPs) (e.g., Hotmail, Yahoo, and Gmail) who have to deal with millions of emails everyday. Two primary challenges these companies face in spam filtering are efficiency and scalability. This study is undertaken to develop an efficient and sc...
متن کاملUsing Biased Discriminant Analysis for Email Filtering
This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples whi...
متن کامل